Search CORE

19 research outputs found

Automatic Accentedness Evaluation of Non-Native Speech Using Phonetic and Sub-Phonetic Posterior Probabilities

Author: Cernak Milos
Magimai.-Doss Mathew
Nanchen Alexandre
Rasipuram Ramya
Publication venue
Publication date: 19/06/2015
Field of study

Automatic evaluation of non-native speech accentedness has potential implications for not only language learning and accent identification systems but also for speaker and speech recognition systems. From the perspective of speech production, the two primary factors influencing the accentedness are the phonetic and prosodic structure. In this paper, we propose an approach for automatic accentedness evaluation based on comparison of instances of native and non-native speakers at the acoustic-phonetic level. Specifically, the proposed approach measures accentedness by comparing phone class conditional probability sequences corresponding to the instances of native and non-native speakers, respectively. We evaluate the proposed approach on the EMIME bilingual and EMIME Mandarin bilingual corpora, which contains English speech from native English speakers and various non-native English speakers, namely Finnish, German and Mandarin. We also investigate the influence of the granularity of the phonetic unit representation on the performance of the proposed accentedness measure. Our results indicate that the accentedness ratings by the proposed approach correlate consistently with the human ratings of accentedness. In addition, our studies show that the granularity of the phonetic unit representation that yields the best correlation with the human accentedness ratings varies with respect to the native language of the non-native speakers

Infoscience - École polytechnique fédérale de Lausanne

Automatic Accentedness Evaluation of Non-Native Speech Using Phonetic and Sub-Phonetic Posterior Probabilities

Author: Alexandre Nanchen
Mathew Magimai-Doss
Milos Cernak
Ramya Rasipuram
Publication venue
Publication date: 01/05/2020
Field of study

Abstract Automatic evaluation of non-native speech accentedness has potential implications for not only language learning and accent identification systems but also for speaker and speech recognition systems. From the perspective of speech production, the two primary factors influencing the accentedness are the phonetic and prosodic structure. In this paper, we propose an approach for automatic accentedness evaluation based on comparison of instances of native and non-native speakers at the acoustic-phonetic level. Specifically, the proposed approach measures accentedness by comparing phone class conditional probability sequences corresponding to the instances of native and non-native speakers, respectively. We evaluate the proposed approach on the EMIME bilingual and EMIME Mandarin bilingual corpora, which contains English speech from native English speakers and various non-native English speakers, namely Finnish, German and Mandarin. We also investigate the influence of the granularity of the phonetic unit representation on the performance of the proposed accentedness measure. Our results indicate that the accentedness ratings by the proposed approach correlate consistently with the human ratings of accentedness. In addition, our studies show that the granularity of the phonetic unit representation that yields the best correlation with the human accentedness ratings varies with respect to the native language of the non-native speakers

CiteSeerX

Exploiting un-transcribed foreign data for speech recognition in well-resourced languages

Author: Bourlard Hervé
Imseng David
Motlicek Petr
Nanchen Alexandre
Potard Blaise
Publication venue
Publication date: 19/04/2014
Field of study

Manual transcription of audio databases for automatic speech recognition (ASR) training is a costly and time-consuming process. State-of-the-art hybrid ASR systems that are based on deep neural networks (DNN) can exploit un-transcribed foreign data during unsupervised DNN pre-training or semi-supervised DNN training. We investigate the relevance of foreign data characteristics, in particular domain and language. Using three different datasets of the MediaParl and Ester databases, our experiments suggest that domain and language are equally important. Foreign data recorded under matched conditions (language and domain) yields the most improvement. The resulting ASR system yields about 5% relative improvement compared to the baseline system only trained on transcribed data. Our studies also reveal that the amount of foreign data used for semi-supervised training can be significantly reduced without degrading the ASR performance if confidence measure based data selection is employed

Infoscience - École polytechnique fédérale de Lausanne

Crossref

A Just-in-Time Document Retrieval System for Dialogues or Monologues

Author: Garner Philip N.
Nanchen Alexandre
Popescu-Belis Andrei
Yazdani Majid
Publication venue
Publication date: 19/12/2013
Field of study

The Automatic Content Linking Device is a just-in-time document retrieval system that monitors an ongoing dialogue or monologue and enriches it with potentially related documents from local repositories or from theWeb. The documents are found using queries that are built from the dialogue words, obtained through automatic speech recognition. Results are displayed in real time to the dialogue participants, or to people watching a recorded dialogue or a talk. The system can be demonstrated in both settings

Infoscience - École polytechnique fédérale de Lausanne

Comparative Study on Sentence Boundary Prediction for German and English Broadcast News

Author: Garner Philip N.
Imseng David
Lazaridis Alexandros
Nanchen Alexandre
Wang Yang
Publication venue: Idiap
Publication date: 19/07/2017
Field of study

We present a comparative study on sentence boundary prediction for German and English broadcast news that explores generalization across different languages. In the feature extraction stage, word pause duration is firstly extracted from word aligned speech, and forward and backward language models are utilized to extract textual features. Then a gradient boosted machine is optimized by grid search to map these features to punctuation marks. Experimental results confirm that word pause duration is a simple yet effective feature to predict whether there is a sentence boundary after that word. We found that Bayes risk derived from pause duration distributions of sentence boundary words and non-boundary words is an effective measure to assess the inherent difficulty of sentence boundary prediction. The proposed method achieved F-measures of over 90% on reference text and around 90% on ASR transcript for both German broadcast news corpus and English multi-genre broadcast news corpus. This demonstrates the state of the art performance of the proposed method

Infoscience - École polytechnique fédérale de Lausanne

Automatic Speech Indexing System of Bilingual Video Parliament Interventions

Author: Cernak Milos
Garner Philip N.
Motlicek Petr
Nanchen Alexandre
Szaszak Gyorgy
Tarsetti Flavio
Publication venue: Idiap
Publication date: 19/12/2013
Field of study

This paper presents the development and evaluation of an automatic audio indexing system designed for a special task: work in a bilingual environment in the Parliament of the Canton of Valais in Switzerland, with two official languages, German and French. As several speakers are bilingual, language changes may occur within speaker or even within utterance. Two audio indexing approaches are presented and compared: in the first, speech indexing is based on bilingual automatic speech recognition; in the second, language identification is used after speaker diarization in order to select the corresponding monolingual speech recognizer for decoding. The approaches are later combined. Speaker adaptive training is also addressed and evaluated. Accuracy of language identification and speech recognition for the monolingual and bilingual cases are presented and compared, in parallel with a brief description of the system and the user interface. Finally, the audio indexing system is also evaluated from an information retrieval point of view

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

MediaParl: Bilingual mixed language accented speech database

Author: Bourlard Hervé
Caesar Holger
Garner Philip N.
Imseng David
Lecorvé Gwénolé
Nanchen Alexandre
Publication venue: Idiap
Publication date: 19/12/2013
Field of study

Infoscience - École polytechnique fédérale de Lausanne

EMPIRICAL EVALUATION AND COMBINATION OF PUNCTUATION PREDICTION MODELS APPLIED TO BROADCAST NEWS

Author: Garner Philip N.
Nanchen Alexandre
Publication venue
Publication date: 25/02/2019
Field of study

Natural language processing techniques are dependent upon punctuation to work well. When their input is taken from speech recognition, it is necessary to reconstruct the punctuation; in particular sentence boundaries. We define a range of features from low level acoustics to those with high level lexical semantics, including deep and recurrent models; these in turn are representative of a broad range of approaches used by previous authors for punctuation prediction. We combine the features using a gradient boosting machine that is also capable of indicating the relative importance of each feature. In an empirical study, we show that features from different semantic levels are in fact complementary, that combining statistical and deep learning methods yields better prediction results, and that generalization across different speaking styles is difficult to achieve without adaptation. Our best model achieves an F-Measure of 82.8 on a challenging broadcast news dataset

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Open-Vocabulary Keyword Spotting with Audio and Text Embeddings

Author: Cernak Milos
Jaggi Martin
Nanchen Alexandre
Sacchi Niccolò
Publication venue
Publication date: 08/08/2019
Field of study

Keyword Spotting (KWS) systems allow detecting a set of spoken (pre-defined) keywords. Open-vocabulary KWS systems search for the keywords in the set of word hypotheses generated by an automatic speech recognition (ASR) system which is computationally expensive and, therefore, often implemented as a cloud-based service. Besides, KWS systems could use also word classification algorithms that do not allow easily changing the set of words to be recognized, as the classes have to be defined a priori, even before training the system. In this paper, we propose the implementation of an open-vocabulary ASR-free KWS system based on speech and text encoders that allow matching the computed embeddings in order to spot whether a keyword has been uttered. This approach would allow choosing the set of keywords a posteriori while requiring low computational power. The experiments, performed on two different datasets, show that our method is competitive with other state of the art KWS systems while allowing for a flexibility of configuration and being computationally efficient

Infoscience - École polytechnique fédérale de Lausanne

Crossref

The ACLD: Speech-based Just-in-Time Retrieval of Multimedia Documents and Websites

Author: Kilgour Jonathan
Nanchen Alexandre
Poller Peter
Popescu-Belis Andrei
Publication venue: Idiap
Publication date: 26/08/2010
Field of study

The Automatic Content Linking Device (ACLD) is a just-in-time retrieval system that monitors an ongoing conversation or a monologue and enriches it with potentially related documents, including transcripts of past meetings, from local repositories or from the Internet. The linked content is displayed in real-time to the participants in the conversation, or to users watching a recorded conversation or talk. The system can be demonstrated in both settings, using real-time automatic speech recognition (ASR) or replaying offline ASR, via a flexible user interface that displays results and provides access to the content of past meetings and documents

Infoscience - École polytechnique fédérale de Lausanne